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ABSTRACT 


The consensus problem involves an asynchronous system of processes, some of which may be 
unreliable. The problem is for the reliable processes to agree on a binary value. We show that 
every protocol for this problem has the possibility of nontermination, even with only one faulty 
process. By way of contrast, solutions are known for the synchronous case, the “Byzantine 
Generals” problem. 


1. Introduction 

The problem of reaching agreement among remote processes is one of the most fundamental 
problems in distributed computing. It is at the core of many of algorithms for distributed data 
processing, distributed file management, and fault-tolerant distributed applications. 


A well-known form of the problem is the “transaction commit problem” which arises in 
distributed database systems [DS1, G, LS, La, Le, Li, R, RLS, S, SS]. The problem is for all the 
data manager processes which have participated in the processing of a particular transaction to 
agree on whether to install the transaction’s results in the database or to discard them. The 
latter action might be necessary, for example, if some data managers were for any reason unable 
to carry out the required transaction processing. Whatever decision is made, all data managers 
must make the same decision in order to preserve the consistency of the database. 


Reaching the type of agreement needed for the “commit” problem is straightforward if the 
participating processes and the network are completely reliable. However, real systems are 
subject to a number of possible faults such as process crashes, network partitioning, and lost, 
distorted or duplicated messages. One can even consider more Byzantine types of failure [DS2, 
DLM, DFFLS, FL, LFF, LSP, PSL] in which faulty processes might go completely haywire, 
perhaps even sending messages according to some malevolent plan. One therefore wants an 
agreement protocol which is as reliable as possible in the presence of such faults. Of course, any 
protocol can be overwhelmed by faults that are too frequent or too severe, so the best that one 
can hope for is a protocol which is tolerant to a prescribed number of “expected” faults. 


In this paper, we show the surprising result that no completely asynchronous consensus 
protocol can tolerate even a single unannounced process death. We do not consider Byzantine 
failures, and we assume that the message system is reliable — it delivers all messages correctly 


and exactly once. Nevertheless, even with these assumptions, the stopping of a single process at 
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an inopportune time can cause any distributed commit protocol to fail to reach agreement. 
Thus, this important problem has no robust solution without further assumptions about the 
computing environment or still greater restrictions on the kind of failures to be tolerated! 


Crucial to our proof is that processing is completely asynchronous, that is, we make no 
assumptions about the relative speeds of processes nor about the delay time in delivering a 
message. We also assume that processes do not have access to synchronized clocks, so algorithms 
based on timeouts, for example, cannot be used. (In particular, the solutions in [DS1] are not 
applicable.) Finally, we do not postulate the ability to detect the death of a process, so it is 
impossible for one processes to tell whether another has died (stopped entirely) or is just running 
very slowly. 


Our impossibility result applies to even a very weak form of the consensus problem. Assume 
every process starts with an initial value in {0,1}. A nonfaulty process decides on a value in 
{0,1} by entering an appropriate decision state. All nonfaulty processes which decide are 
required to choose the same value. For the purpose of the impossibility proof, we require only 
that some process eventually make a decision. (Of course, any algorithm of interest would 
require that all nonfaulty processes make a decision.) The trivial solution in which, say, 0 is 
always chosen is ruled out by stipulating that both 0 and 1 are possible decision values, although 


perhaps for different initial configurations. 


Our system model is rather strong so as to make our impossibility proof as widely applicable 
as possible. Processes are modelled as automata (with possibly infinitely many states) which 
communicate by means of messages. In one atomic step, a process can attempt to receive a 
message, perform local computation based on whether or not a message was delivered to it and if 
so on which one, and send an arbitrary but finite set of messages to other processes. In 
particular, an “atomic broadcast” capability is assumed, so a process can send the same message 
in one step to all other processes with the knowledge that if any nonfaulty process receives the 
message, then all the nonfaulty processes will. Every message is eventually delivered as long as 
the destination process makes infinitely many attempts to receive, but messages can be delayed 


arbitrarily long and delivered out of order. 


The asynchronous commit protocols in current use all seem to have a “window of 
vulnerability” — an interval of time during the execution of the algorithm in which the delay or 
inaccessibility of a single process can cause the entire algorithm to wait indefinitely. It follows 
from our impossibility result that every commit protocol has such a “window”, confirming a 
widely-believed tenet in the folklore. 


2. Consensus Protocols 

A consensus protocol P is an asynchronous system of N processes (N > 2). Each process p 
has a one-bit input register xX,» an output regtater 35 with values in {b, 0, 1}, and an unbounded 
amount of internal storage. The values in the input and output registers together with the 
program counter and internal storage comprise the internal state. Initial states prescribe fixed 
starting values for all but the input register; in particular, the output register starts with value b. 
The states in which the output register has value 0 or 1 are distinguished as being deciston 
states. p acts deterministically according to a transition function. The transition function 
cannot change the value of the output register once the process has reached a decision state; that 
is, the output register is “write-once”. The entire system P is specified by the transition 


functions associated with each of the processes and the initial values of the input registers. 


Processes communicate by sending each other messages. A message is a pair (p, m), where p 
is the name of the destination process and m is a “message value” from a fixed universe M. The 
message system maintains a multiset, called the message buffer, of messages that have been sent 


but not yet delivered . It supports two abstract operations: 


send(p, m): places (p, m) in the message buffer; 


receive(p): deletes some message (p, m) from the buffer and returns m, in which case we 
say (p, m) is delivered, or returns the special null marker ¢ and leaves the 
buffer unchanged. 


Thus, the message system acts nondeterministically, subject only to the condition that if 
receive(p) is performed infinitely many times, then every message (p, m) in the message buffer is 
eventually delivered. In particular, the message system is allowed to return ¢ a finite number of 


times in response to receive(p) even though a message (p, m) is present in the buffer. 


A configuration of the system consists of the internal state of each process together with the 
contents of the message buffer. An initial configuration is one in which each process starts at 


an initial state and the message buffer is empty. 


A step takes one configuration to another and consists of a primitive step by a single process 
p. Let C be a configuration. The step occurs in two phases. First, receive(p) is performed on 
the message buffer in C to obtain a value m € M U {¢}. Then, depending on p’s internal state 
in C and on m, p enters a new internal state and sends a finite set of messages to other processes. 
Since processes are deterministic, the step is completely determined by the pair e = (p, m), which 
we call an event. (This “event” should be thought of as the receipt of m by p.) e(C) denotes the 
resulting configuration and we say that e can be applied to C. Note that the event (p, ¢) can 
always be applied to C, so it is always possible for a process to take another step. 


a 


A schedule from C is a finite or infinite sequence o of events which can be applied, in turn, 
starting from C. The associated sequence of steps is called a run. If o is finite, we let o(C) 
denote the resulting configuration, which is said to be reachable from C. A configuration 
reachable from some initial configuration is said to be accessible. Hereafter, all configurations 
mentioned are assumed to be accessible. 


The following lemma expresses a “commutativity” property of schedules. 


Lemma 1. Suppose that from some configuration C the schedules 1, % lead to 
configurations C Co respectively. If the sets of processes taking steps in o, and g, 
respectively are distant: then 0, can be applied to C, and o, can be applied “ Co, aaa 
both lead to the same configuration C,. (See Figure 1. ) 
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Figure 1. 


Proof. The result follows at once from the system definition since o, and o, do not interact. 
oO 
A configuration C has dectsston value v if some process p is in a decision state with y= Vv. 


A consensus protocol is partially correct if it satisfies two conditions: 


1. No accessible configuration has more than one decision value. 


2. For each v € {0, 1}, some accessible configuration has decision value v. 


A process p is nonfaulty in a run provided it takes infinitely many steps, and is faulty 
otherwise. A run is admissible provided at most one process is faulty, and provided all messages 
sent to nonfaulty processes are eventually received. 


A run is a deciding run provided some process reaches a decision state in that run. A 
consensus protocol P is totally correct in spite of one fault if it is partially correct, and every 


admissible run is a deciding run. Our main theorem shows that every partially correct protocol 
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for the consensus problem has some admissible run which is not a deciding run. 


3. Main Result 
Theorem I. No consensus protocol is totally correct in spite of one fault. 


Proof. Assume to the contrary that P is a consensus protocol which is totally correct in spite 
of one fault. We prove a sequence of lemmas which eventually lead to a contradiction. 


The basic idea is to show circumstances under which the protocol remains forever indecisive. 
This involves two steps. First, we argue that there is some initial configuration in which the 
decision is not already predetermined. Secondly, we construct an admissible run which avoids 
ever taking a step that would commit the system to a particular decision. 


Let C be a configuration and let V be the set of decision values of configurations reachable 
from C. C is bivalent if |V| = 2. C is univalent if |V| = 1, let us say 0-valent or 1-valent 
according to the corresponding decision value. By the total correctness of P, and the fact that 
these are always admissible runs, V # ¢. 


Lemma 2. P has a bivalent initial configuration. 


Proof. Assume not. Then P must have both 0-valent and 1-valent initial configurations by 
the assumed partial correctness. Let us call two initial configurations adjacent if they differ only 
in the initial value x, of a single process p. Any two initial configurations are joined by a chain 
of initial configurations, each adjacent to the next. Hence, there must exist a 0-valent initial 
configuration Cy adjacent to a 1-valent initial configuration C,. Let p be the process in whose 
initial value they differ. 


Now consider some admissible deciding run from Cy in which process p takes no steps, and let 
o be the associated schedule. Then o can be applied to Ci also, and corresponding configurations 
in the two runs are identical except for the internal state of process p. It is easily shown that 
both runs eventually reach the same decision value. If the value is 1, then Cy is bivalent; 
otherwise, C, is bivalent. Either case contradicts the assumed nonexistence of a bivalent initial 


configuration. 0 


Lemma 3. Let C be a bivalent configuration of P, and let e = (p, m) be an event 
which is applicable to C. Let C be the set of configurations reachable from C without 
applying e, and let D = e(C) = {e(E)| E € C and e is applicable to E}. Then D 
contains a bivalent configuration. 


Proof. Since e is applicable to C, then by definition of C and the fact that messages can be 
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delayed arbitrarily, e is applicable to every E € C. 


Now assume that D contains no bivalent configurations, so every configuration D € D is 


univalent. We proceed to derive a contradiction. 


Let E. be an i-valent configuration reachable from C, i = 0, 1. (E, exists since C is bivalent.) 
If E, € C, let F, = e(E,) € D. Otherwise, e was applied in reaching E., and so there exists F; € D 
from which E, is reachable. In either case, F, is i-valent since F; is not bivalent (by assumption) 
and one of E, and F. is reachable from the other. Since F. € D, i = 0, 1, D contains both 0- 


valent and 1-valent configurations. 


Call two configurations netghbora if one results from the other in a single step. By an easy 
induction, there exist neighbors Cy, C, € C such that D, = e(C.) is i-valent, i = 0, 1. Without 
loss of generality, C, = e(C,) where e' = (p’, m’). 

CASE 1: If p! A p, then D, = e(D)) by Lemma 1. This is impossible since any successor of 
a 0-valent configuration is 0-valent. (See Figure 2.) 
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CASE 2: If p! = p, then consider any finite deciding p-free run from Co with corresponding 
schedule o, and let A = o(C,). By Lemma 1, a is applicable to D;, and it leads to an i-valent 
configuration E; = o(D;), i = 0, 1. Also by Lemma 1, e(A) = Ey and e(e(A)) = E,. (See 


Figure 3.) Hence, A is bivalent, which is impossible since A is univalent. 


In each case, we reached a contradiction, so D contains a bivalent configuration. Oo 


Any deciding run from a bivalent initial configuration goes to a univalent configuration, so 
there must be some single step which goes from a bivalent to a univalent configuration. Such a 
step determines the eventual decision value. We now show that it is always possible to run the 


system in a way that avoids such steps, leading to an admissible non-deciding run. 
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The run is constructed in stages, starting from an initial configuration. We ensure that the 
run is admissible in the following way. A queue of processes is maintained, initially in an 
arbitrary order, and the message buffer in a configuration is ordered according to the time the 
messages were sent, earliest first. Each stage consists of one or more process steps. The stage 
ends with the first process in the process queue taking a step in which, if its message queue was 
not empty at the start of the stage, its earliest message is received. This process is then moved 
to the back of the process queue. In any infinite sequence of such stages every process takes 
infinitely many steps and receives every message sent to it. The run is therefore admissible. Our 
problem of course is to do this in such a way as to avoid a decision ever being reached. 


Let Cy be a bivalent initial configuration whose existence is assured by Lemma 2. Execution 
begins in Co, and we ensure that every stage begins from a bivalent configuration. Suppose then 
that configuration C is bivalent and that process p heads the priority queue. Let m be the 
earliest message to p in C’s message buffer, if any, and ¢ otherwise. Let e = (p, m). By Lemma 
3, there is a bivalent configuration C’ reachable from C by a schedule in which e is the last event 


applied. The corresponding sequence of steps defines the stage. 


Since each stage ends in a bivalent configuration, every stage in the construction of the 
infinite schedule succeeds. The resulting run is admissible, and no decision is ever reached. It 


follows that P is not totally correct. Oo 


4. Initially Dead Processes 

In this section, we exhibit a protocol which solves the consensus problem for N processes as 
long as a majority of the processes are non-faulty and no process dies during the execution of the 
protocol. No process knows in advance, however, which of the processes are initially dead and 


which are not. 


The protocol works in two stages. During the first stage, the processes construct a directed 
graph G with a node corresponding to each process. Every process broadcasts a message 
containing its process number and initial value and then listens for messages from L—1 other 
processes, where L = [(N + 1)/2]. G has an edge from i to j iff j receives a message from i. 
Thus, G has indegree L—1. 


In the second stage, the processes construct G*t, the transitive closure of G, in the sense that 
upon completion of this stage, each process k knows about all of the edges (j, k) incident on k in 
G*. Each k also knows the initial values of all such j. After k discovers such an edge in G*, we 
say that k knows about that edge and about the node j. 


The computation of Gt is carried out in the following way. First, each process broadcasts to 
all other processes its process number and initial value together with the names of the L—1 
processes it heard from during the first stage. It then waits until it has received both the stage 1 
and stage 2 messages from all its ancestors in G which it knows about. It initially knows only 
about the L—1 processes from which it heard directly during the first stage, but as it receives 
stage 2 messages, it may discover additional ancestors. Waiting continues until such time as all 


currently known about processes have been heard from. 


At this point, each process knows all of its own ancestors and the edges of G incident on 
them, so it can compute all of the edges of Gt incident on each of its ancestors. This enables it 
to determine which of its ancestors belong to an initial clique of G*, that is, a clique with no 
incoming edges, for node k is in an initial clique iff k is itself an ancestor of every one of its 
ancestors. Since every node in G* has at least L—1 predecessors, there can be only one initial 
clique, it has cardinality at least L, and every process which completes the second stage knows 


exactly the set of processes comprising it. 


Finally, each process makes a decision based on the initial values of the processes in the initial 
clique using any agreed-upon rule. Since all processes know the initial values of all members of 


the initial clique, they all reach the same decision. 


The correctness of this protocol proves the following theorem. 


Theorem II. There is a partially correct consensus protocol in which all nonfaulty 
processes always reach a decision, provided no processes die during its execution and a 
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strict majority of the processes are alive initially. 
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